SDMX, which stands for Statistical Data and Metadata eXchange is an international initiative that aims at standardising and modernising (“industrialising”) the mechanisms and processes for the exchange of statistical data and metadata among international organisations and their member countries
Eurostat, OECD, UN, World Bank, IMF and many others use SDMX in order to transit their datasets.
Using R one can handle these datasets either with the rsdmx library or with specific APIs, like the eurostat library: https://ropengov.github.io/eurostat/
Every dataset from Eurostat can be defined by a unique name. We just need to provide this name to the API, for example:
## Table tin00028 cached at /tmp/RtmpOTDlji/eurostat/tin00028_date_code_TF.rds
## # A tibble: 1,712 x 6
## ind_type unit indic_is geo time values
## <fct> <fct> <fct> <fct> <date> <dbl>
## 1 IND_TOTAL PC_IND I_ILT12 AT 2007-01-01 69
## 2 IND_TOTAL PC_IND I_ILT12 BE 2007-01-01 69
## 3 IND_TOTAL PC_IND I_ILT12 BG 2007-01-01 34
## 4 IND_TOTAL PC_IND I_ILT12 CY 2007-01-01 41
## 5 IND_TOTAL PC_IND I_ILT12 CZ 2007-01-01 52
## 6 IND_TOTAL PC_IND I_ILT12 DE 2007-01-01 75
## 7 IND_TOTAL PC_IND I_ILT12 DK 2007-01-01 85
## 8 IND_TOTAL PC_IND I_ILT12 EA 2007-01-01 61
## 9 IND_TOTAL PC_IND I_ILT12 EE 2007-01-01 66
## 10 IND_TOTAL PC_IND I_ILT12 EL 2007-01-01 36
## # … with 1,702 more rows
It is a good idea, especially if we work with big datasets, to download and save locally the files. For example:
tin00028 <- get_eurostat("tin00028", stringsAsFactors = FALSE)
teina225 <- get_eurostat("teina225", stringsAsFactors = FALSE)
tec00127 <- get_eurostat("tec00127", stringsAsFactors = FALSE) # deficit
tec00115 <- get_eurostat("tec00115", stringsAsFactors = FALSE) # growth
SHP_EU <- get_eurostat_geospatial(output_class = "sf", resolution = 20)
write_rds(tin00028, "../data/tin00028.rds")
write_rds(teina225, "../data/teina225.rds")
write_rds(tec00127, "../data/tec00127.rds")
write_rds(tec00115, "../data/tec00115.rds")
write_rds(SHP_EU, "../data/SHP_EU.rds")Data can be imported as follows:
tin00028 %>%
filter(geo == 'IT' & indic_is == 'I_IUX') %>%
ggplot(aes(x = time, y = values)) +
geom_col(fill = "#336633", width = 200) +
theme_bw() +
theme(text = element_text(size = 11)) +
labs(x = "Year", y = "% of population",
title = "Italians that never used Internet",
subtitle = "% of population of age 16-74")tin00028 %>%
filter(geo %in% c('IT', 'EL') & indic_is == 'I_IUX') %>%
ggplot(aes(x = time, y = values, fill = geo)) +
geom_col(width = 200, position = "dodge") +
theme_bw() +
theme(text = element_text(size = 10)) +
labs(x = "Year", y = "% of population",
title = "Italians and Greeks that never used Internet",
subtitle = "% of population of age 16-74")plotly is js library for interactive graphics. Here is an example how easly can be used:
plt <- tin00028 %>%
filter(geo %in% c('IT', 'EL') & indic_is == 'I_IUX') %>%
ggplot(aes(x = time, y = values, fill = geo)) +
geom_col(width = 200, position = "dodge") +
theme_bw() +
theme(text = element_text(size = 12)) +
labs(x = "Year", y = "% of population",
title = "Italians and Greeks that never used Internet",
subtitle = "% of population of age 16-74")
ggplotly(plt)eurostat API returns data as data frames (tibbles in tidyverse terminology) with the SMDX format preserved. Here is how data look like:
## # A tibble: 1,712 x 6
## ind_type unit indic_is geo time values
## <chr> <chr> <chr> <chr> <date> <dbl>
## 1 IND_TOTAL PC_IND I_ILT12 AT 2007-01-01 69
## 2 IND_TOTAL PC_IND I_ILT12 BE 2007-01-01 69
## 3 IND_TOTAL PC_IND I_ILT12 BG 2007-01-01 34
## 4 IND_TOTAL PC_IND I_ILT12 CY 2007-01-01 41
## 5 IND_TOTAL PC_IND I_ILT12 CZ 2007-01-01 52
## 6 IND_TOTAL PC_IND I_ILT12 DE 2007-01-01 75
## 7 IND_TOTAL PC_IND I_ILT12 DK 2007-01-01 85
## 8 IND_TOTAL PC_IND I_ILT12 EA 2007-01-01 61
## 9 IND_TOTAL PC_IND I_ILT12 EE 2007-01-01 66
## 10 IND_TOTAL PC_IND I_ILT12 EL 2007-01-01 36
## # … with 1,702 more rows
## [1] "tbl_df" "tbl" "data.frame"
## Classes 'tbl_df', 'tbl' and 'data.frame': 1712 obs. of 6 variables:
## $ ind_type: chr "IND_TOTAL" "IND_TOTAL" "IND_TOTAL" "IND_TOTAL" ...
## $ unit : chr "PC_IND" "PC_IND" "PC_IND" "PC_IND" ...
## $ indic_is: chr "I_ILT12" "I_ILT12" "I_ILT12" "I_ILT12" ...
## $ geo : chr "AT" "BE" "BG" "CY" ...
## $ time : Date, format: "2007-01-01" "2007-01-01" ...
## $ values : num 69 69 34 41 52 75 85 61 66 36 ...
Data display in the console is not always enough. A better and more efficiency way to explore data is to use the DT::datatable function. Here is an example:
Eurostat: Internet use by individuals
https://ec.europa.eu/eurostat/data/database?node_code=tin00028
Eurostat: Internet usage by individuals
https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tin00028
https://ec.europa.eu/eurostat/tgm/download.do?tab=table&plugin=1&language=en&pcode=tin00028
but:
What You See is What You Get
Of course, we do not store datasets in this format.
So, there is no reason to use them in this format.
Forget Excel files and use directly SDMX tables. It’s Easy!
That’s easy!
This rowwise procedure, very similar to relational databases and SQL is essential.
We do not perform these actions to the actual data sets, we ARE NOT the Eurostat data administrators.
But as we DO data management and analysis we have to do it sometimes.
So, in every dataset:
Eurostat library provides the table (data frame) eu_countries with the EU (curently) names and ISO 3166-1 two-letter codes:
We will use info from this table in order to exclude data fro other countries in the following slides: we will focus only in the EU-28 member countries.
We will create a new tibble as follows:
EU <- eu_countries %>%
mutate(iso_geo = harmonize_country_code(code)) %>%
select(iso_geo, geo = code, name) %>%
arrange(iso_geo) %>%
as_tibble()Let’s view the result:
When plotting we use colors. It is a good idea to see the available color palettes, provided by the RColorBrewer library:
There is also a way to examine specific colors of a palette. For example here are 5 colors of the RdBu palette:
Or, we can see the hex codes of the colors:
## [1] "#CA0020" "#F4A582" "#F7F7F7" "#92C5DE" "#0571B0"
Table teina225 http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=teina225&lang=en provides the annual general government gross debt:
# debt <- get_eurostat("teina225", stringsAsFactors = FALSE) %>%
# mutate(Year = year(time))
debt <- teina225 %>%
mutate(Year = year(time))
debt## # A tibble: 768 x 7
## unit sector na_item geo time values Year
## <chr> <chr> <chr> <chr> <date> <dbl> <dbl>
## 1 MIO_EUR S13 GD AT 2007-01-01 184674. 2007
## 2 MIO_EUR S13 GD BE 2007-01-01 299997. 2007
## 3 MIO_EUR S13 GD BG 2007-01-01 5297. 2007
## 4 MIO_EUR S13 GD CY 2007-01-01 9462. 2007
## 5 MIO_EUR S13 GD CZ 2007-01-01 39608. 2007
## 6 MIO_EUR S13 GD DE 2007-01-01 1600016. 2007
## 7 MIO_EUR S13 GD DK 2007-01-01 63755 2007
## 8 MIO_EUR S13 GD EA18 2007-01-01 6108606. 2007
## 9 MIO_EUR S13 GD EA19 2007-01-01 6113216. 2007
## 10 MIO_EUR S13 GD EE 2007-01-01 595. 2007
## # … with 758 more rows
For more convinient view of the data:
We look at the structure of the dataset:
## Classes 'tbl_df', 'tbl' and 'data.frame': 768 obs. of 7 variables:
## $ unit : chr "MIO_EUR" "MIO_EUR" "MIO_EUR" "MIO_EUR" ...
## $ sector : chr "S13" "S13" "S13" "S13" ...
## $ na_item: chr "GD" "GD" "GD" "GD" ...
## $ geo : chr "AT" "BE" "BG" "CY" ...
## $ time : Date, format: "2007-01-01" "2007-01-01" ...
## $ values : num 184674 299997 5297 9462 39608 ...
## $ Year : num 2007 2007 2007 2007 2007 ...
and it is a good idea to examine the domain of the values of the columns:
## # A tibble: 2 x 1
## unit
## <chr>
## 1 MIO_EUR
## 2 PC_GDP
## # A tibble: 1 x 1
## sector
## <chr>
## 1 S13
## # A tibble: 1 x 1
## na_item
## <chr>
## 1 GD
One has to refer to Eurostat’s website for more information. Clearly, if we do this a first time we have to carfully read the specifications: what exactly is the numeric value we are looking at.
For the subsequent work we will with debt as percentage of the GDP, so we will be interested in PC_GDP (unit column).
Debt of the general Government as % of GDP, 2018
debt %>%
filter(unit == 'PC_GDP' & Year == 2018) %>%
semi_join(EU, by = "geo") %>%
arrange(values) %>%
mutate(Country = factor(geo, geo)) %>%
ggplot(aes(x = Country, y = values)) +
geom_col(fill = "#41AB5D", width = 0.5) +
scale_y_continuous(breaks = seq(0, 200, by = 25)) +
labs(y = "% GDP") +
coord_flip() +
theme_bw()
Average debt of EU28/EA19 countries
debt of 2017
debt_1 <- debt %>%
inner_join(EU, by = "geo") %>%
filter(unit == 'PC_GDP' & Year == 2017) %>%
arrange(values) %>%
mutate(Country = factor(name, name)) %>%
select(geo, Country, values)
Difference of debt between 2017 and 2018
debt of 2017 and diff from 2018
Difference of debt between 2017 and 2018
debt %>%
inner_join(EU, by = "geo") %>%
filter(unit == 'PC_GDP' & Year == 2018) %>%
arrange(values) %>%
mutate(Country = factor(name, name)) %>%
ggplot(aes(x = Country, y = values)) +
geom_blank() +
annotate("text", x = as.factor("Estonia"), y = debt_EU-8, label = "EU28", color = "blue", fontface = 1, size = 5) +
annotate("text", x = as.factor("Estonia"), y = debt_EA+8, label = "EA19", color = "red", fontface = 1, size = 5) +
geom_col(fill = "#41AB5D", width = 0.2) +
geom_point(color = "#41AB5D", size = 6) +
geom_point(data = debt_1, aes(x = Country, y = values), size = 5, shape = 2, color = "#41AB5D") +
geom_text(data = debt_diff, aes(x = Country, y = -8, label = debt_diff), size = 4) +
geom_hline(yintercept = debt_EU, color = "blue") +
geom_hline(yintercept = debt_EA, color = "red") +
labs(title = "General government gross debt as % of GDP, 2018",
subtitle = "Numbers and open triangles indicate change from 2017",
y = "% of GDP",
caption = "Data source: Eurostat table teina225") +
scale_y_continuous(breaks = seq(0, 200, by = 25)) +
coord_flip() +
theme_minimal() +
theme(panel.border = element_blank()) +
theme(panel.grid.minor = element_blank()) +
theme(panel.grid.major.y = element_blank()) +
theme(text = element_text(size = 16))Shapefiles can be downloaded directly as follows.
SHP_EU_0 <- SHP_EU %>%
filter(LEVL_CODE == 0) %>%
semi_join(EU, by = "geo") %>%
select(NUTS_NAME, geo, geometry) %>%
arrange(geo)
SHP_EU_0## Simple feature collection with 28 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -63.08825 ymin: -21.39077 xmax: 55.83808 ymax: 70.08134
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
## NUTS_NAME geo geometry
## 1 ÖSTERREICH AT MULTIPOLYGON (((15.75363 48...
## 2 BELGIQUE-BELGIË BE MULTIPOLYGON (((4.75993 51....
## 3 БЪЛГАРИЯ BG MULTIPOLYGON (((22.99717 43...
## 4 ΚΥΠΡΟΣ CY MULTIPOLYGON (((32.27382 35...
## 5 ČESKÁ REPUBLIKA CZ MULTIPOLYGON (((14.6188 50....
## 6 DEUTSCHLAND DE MULTIPOLYGON (((9.1131 54.8...
## 7 DANMARK DK MULTIPOLYGON (((10.19436 56...
## 8 EESTI EE MULTIPOLYGON (((25.83016 59...
## 9 ΕΛΛΑΔΑ EL MULTIPOLYGON (((26.03276 40...
## 10 ESPAÑA ES MULTIPOLYGON (((-7.03184 43...
DF <- debt_diff %>%
inner_join(SHP_EU_0, by = "geo") %>%
select(Country, values=debt_diff, geometry) %>%
st_as_sf()
myPal <- colorNumeric(palette = "Spectral", domain = DF$values)
LABELS <- str_c(DF$Country,"<br>", DF$values, " %")
DF %>%
leaflet() %>%
setView(lng = 12.25, lat = 51.34, zoom = 4) %>%
addPolygons(weight = 1, color = "white",
fillColor = ~myPal(values), fillOpacity = 0.5, popup = LABELS) %>%
addLegend("topright", pal = myPal, values = ~values, title = "% of debt change") %>%
addControl(html = "<small>(C) EuroGeographics for the administrative boundaries</small>", position = "bottomleft")DF <- debt_diff %>%
inner_join(SHP_EU_0, by = "geo") %>%
select(Country, values=debt_diff, geometry) %>%
st_as_sf()
LABELS <- str_c(DF$Country,"<br>", DF$values, " %")
DF %>%
leaflet() %>%
setView(lng = 12.25, lat = 51.34, zoom = 4) %>%
addPolygons(weight = 1, color = "black",
fillColor = ~myPal(values), fillOpacity = 0.5, popup = LABELS) %>%
addLegend("topright", pal = myPal, values = ~values, title = "% of debt change") %>%
addControl(html = "<small>(C) EuroGeographics for the administrative boundaries</small>", position = "bottomleft")DF <- debt_diff %>%
inner_join(SHP_EU_0, by = "geo") %>%
select(Country, values=debt_diff, geometry) %>%
st_as_sf()
LABELS <- str_c(DF$Country,"<br>", DF$values, " %")
DF %>%
leaflet() %>%
setView(lng = 12.25, lat = 51.34, zoom = 4) %>%
addPolygons(weight = 1, color = "black",
fillColor = ~myPal(values), fillOpacity = 0.5, popup = LABELS) %>%
addLegend("topright", pal = myPal, values = ~values, title = "% of debt change") %>%
addControl(html = "<small>(C) EuroGeographics for the administrative boundaries</small>", position = "bottomleft")DF <- debt_diff %>%
inner_join(SHP_EU_0, by = "geo") %>%
select(Country, values=debt_diff, geometry) %>%
st_as_sf()
LABELS <- str_c(DF$Country,"<br>", DF$values, " %")
DF %>%
leaflet() %>%
setView(lng = 12.25, lat = 51.34, zoom = 4) %>%
addPolygons(weight = 1, color = "black",
fillColor = ~myPal(values), fillOpacity = 0.5, popup = LABELS) %>%
addLegend("topright", pal = myPal, values = ~values, title = "% of debt change") %>%
addControl(html = "<small>(C) EuroGeographics for the administrative boundaries</small>", position = "bottomleft")DF <- debt_diff %>%
inner_join(SHP_EU_0, by = "geo") %>%
select(Country, values=debt_diff, geometry) %>%
st_as_sf()
LABELS <- str_c(DF$Country,"<br>", DF$values, " %")
DF %>%
leaflet() %>%
setView(lng = 12.25, lat = 51.34, zoom = 4) %>%
addPolygons(weight = 1, color = "black",
fillColor = ~myPal(values), fillOpacity = 0.5, popup = LABELS) %>%
addLegend("topright", pal = myPal, values = ~values, title = "% of debt change") %>%
addControl(html = "<small>(C) EuroGeographics for the administrative boundaries</small>", position = "bottomleft")Change the palette to Quantile colors (4 quantiles) and reverse the sequense of colors.
Change the palette to Quantile colors (5) and reverse the sequense of colors. See the different map below.
DF <- debt_diff %>%
inner_join(SHP_EU_0, by = "geo") %>%
select(Country, values=debt_diff, geometry) %>%
st_as_sf()
LABELS <- str_c(DF$Country,"<br>", DF$values, " %")
DF %>%
leaflet() %>%
setView(lng = 12.25, lat = 51.34, zoom = 4) %>%
addPolygons(weight = 1, color = "black",
fillColor = ~myPal(values), fillOpacity = 0.5, popup = LABELS) %>%
addLegend("topright", pal = myPal, values = ~values, title = "% of debt change") %>%
addControl(html = "<small>(C) EuroGeographics for the administrative boundaries</small>", position = "bottomleft")A qualitive palette, for example Set1 is a very bad choise, See Below.
Any remards?
debt %>%
inner_join(EU, by = "geo") %>%
rename(Country = name) %>%
filter(unit == 'PC_GDP') %>%
filter(geo %in% c('EL')) %>%
ggplot(aes(x = Year, y = values, colour = Country)) +
geom_line(size = 1.2) +
labs(y = "% of GDP", title = "General Goverment Debt") +
theme_bw() +
theme(text = element_text(size = 16)) +
theme(legend.position = "none")debt %>%
inner_join(EU, by = "geo") %>%
rename(Country = name) %>%
filter(unit == 'PC_GDP') %>%
filter(geo %in% c('EL', 'IT')) %>%
ggplot(aes(x = Year, y = values, colour = Country)) +
geom_line(size = 1.2) +
labs(y = "% of GDP", title = "General Goverment Debt") +
theme_bw() +
theme(text = element_text(size = 16)) +
theme(legend.position = c(0.9, 0.2), legend.background = element_rect(fill = "grey90"))debt %>%
inner_join(EU, by = "geo") %>%
rename(Country = name) %>%
filter(unit == 'PC_GDP') %>%
ggplot(aes(x = Year, y = values, colour = Country)) +
geom_line(size = 1.2) +
labs(y = "% of GDP", title = "General Goverment Debt") +
gghighlight(use_group_by = FALSE, geo %in% c('EL', 'IT'), unhighlighted_params=list(colour = "black", size = 0.8)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(text = element_text(size = 16)) ## label_key: Country
debt %>%
inner_join(EU, by = "geo") %>%
rename(Country = name) %>%
filter(unit == 'PC_GDP') %>%
ggplot(aes(x = Year, y = values)) +
geom_line() +
facet_wrap(~Country) +
labs(y = "% of GDP", title = "General Goverment Debt") +
scale_x_continuous(breaks = seq(2008, 2018, by = 4)) +
theme_bw() +
theme(text = element_text(size = 16)) library(geofacet)
eu_grid1[3, "name"] <- "Czechia"
debt %>%
inner_join(EU, by = "geo") %>%
rename(Country = name) %>%
filter(unit == 'PC_GDP') %>%
ggplot(aes(x = Year, y = values)) +
geom_line(color = "#4682B4", size = 1.2) +
facet_geo(~geo, grid = "eu_grid1") +
labs(y = "% of GDP", title = "General Goverment Debt") +
scale_x_continuous(breaks = seq(2008, 2018, by = 4)) +
theme_bw() +
theme(text = element_text(size = 16))
Growth and deficit/surplus of EU countries during 2018
Inspired by Timo Grossenbacher https://timogrossenbacher.ch/2019/04/bivariate-maps-with-ggplot2-and-sf/